So data has been collected, cleaned and organized. Lets do a short introduction to what we’re working with here.
aei <- read.csv2("/Volumes/RachelExternal/Thesis/Data/SiebertData/HID_v10/Country_tenyear_ha.csv")
aei <- as.data.frame(aei)
#rename the cols so that I can use them as number
colnames(aei) <- c("country", "ISO", "ID", "1900", "1910","1920","1930","1940","1950","1960","1970","1980","1985","1990","1995","2000","2005")
#tidying data to long format
aei <- pivot_longer(aei,cols = 4:17, names_to = "year", values_to = "aei_ha")
aei$aei_ha <- removePunctuation(aei$aei_ha)
aei$year <- as.numeric(aei$year)
aei$aei_ha <- as.numeric(aei$aei_ha)
aei$ISO <- as.factor(aei$ISO)
aei$country <- as.factor(aei$country)
#subset the data so that its from the year 1960
aei <- aei %>% subset(year >= 1960)
#make a new col for year count
aei$yearcount <- (aei$year - 1960)
#remove data for the total global irrigated area
globeaei <- subset(aei, country == "WORLD")
aei= aei[!aei$country == "WORLD",]
str(aei)
tibble [1,848 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 232 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 2 2 ...
$ ISO : Factor w/ 232 levels "","ABW","AFG",..: 3 3 3 3 3 3 3 3 6 6 ...
$ ID : int [1:1848] 2 2 2 2 2 2 2 2 6 6 ...
$ year : num [1:1848] 1960 1970 1980 1985 1990 ...
$ aei_ha : num [1:1848] 2404990 2408168 2518178 2594394 2903569 ...
$ yearcount: num [1:1848] 0 10 20 25 30 35 40 45 0 10 ...
Added to this to be able to calculate the % of land irrigated per country we need the country area. This data is taken from Worldbank.
#import the data set with total country area, this comes from the Worldbank
countryarea <- read.csv2("/Volumes/RachelExternal/Thesis/Data/LandArea/LandArea.csv")
#removing extra cols
countryarea <- countryarea[-c(265:270), ]
countryarea <- countryarea[, -c(1, 3:4)]
#fix the col names (in the two days since i found this little for loop I learned how to do this faster!)
for ( col in 2:ncol(countryarea)){
colnames(countryarea)[col] <- sub("X", "", colnames(countryarea)[col])
}
remove(col)
#making long data
countryarea <-
countryarea %>%
pivot_longer(!Country.Code, names_to = "year", values_to = "area_km") %>%
rename(ISO = Country.Code)
#fixing variable types
countryarea$ISO <- as.factor(countryarea$ISO)
countryarea$year <- as.numeric(countryarea$year)
str(countryarea)
tibble [16,104 × 3] (S3: tbl_df/tbl/data.frame)
$ ISO : Factor w/ 264 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : num [1:16104] 1960 1961 1962 1963 1964 ...
$ area_km: num [1:16104] NA 180 180 180 180 180 180 180 180 180 ...
For some reason, the data from Worldbank on each countries area only starts in 1961. For this analysis, 1960 is needed. Country area from 1961 will be substituted for 1960, with the assumption that country area doesn’t change drastically in one year (it doesn’t).
head(subset(countryarea, c(year == 1960 | year == 1961))) #thought i knew how to use pipes, apparently not.
#fixing 1960
countryarea <-
countryarea %>%
group_by(ISO) %>%
fill(area_km, .direction = "up") %>%
ungroup()
Just merging the country area with the aei df here.
#combine these data sets and calculate the irrfrac per country
aei <- aei %>% merge(y = countryarea, by.x = c("ISO", "year"), all.x = TRUE)
rm(countryarea)
head(aei)
For some reason 1960 (the data starts in 1961) there is an issue with the registration of the total country area. Fixing it by using the value from 1970. Some countries also have no area value. I’ll go through and see if any of the countries with missing area values have non zero aei_ha. If so, I’ll replace them, as they didn’t come through with the WorldBank data set but are online. Copying and pasting from various sources should be fine.
#checking for countries with NA values in their area_km col
#looking primarily for countres that have non 0 aei_ha
errors <- aei[is.na(aei$area_km),]
#fixing counrties with no area
aei$area_km[which(aei$ISO == "GUF")] <- 83534 #French Guiana from Britannica
aei$area_km[which(aei$ISO == "GLP")] <- 1630 #Guadeloupe from Britannica
aei$area_km[which(aei$ISO == "MTQ")] <- 1128 # Martinique from Britannica
aei$area_km[which(aei$ISO == "REU")] <- 2512 #Reunion from Britannica
aei$area_km[which(aei$ISO == "SMK")] <- 13812 + 10905 + 77589 #Montenegro + Kosovo + Serbia from Britannica
aei$area_km[which(aei$ISO == "TWN")] <- 35980 #Tawian from CIA.gov
aei$area_km[which(aei$ISO == "SDN")] <- 1882000 #Sudan from UNDP
#check errors again then rm
rm(errors)
Now, transform the AEI_ha to km^2 and divide to get the % of total country area that is equipped for irrigation. There are still some countries that have 0 aei_ha and no area_km value. I think I will assign the irrfrac/perc 0 to those countries, but lets wait a little while to do that, most are islands.
aei <- aei %>% mutate(irrperc = ((aei_ha/100)/area_km)*100, irrfrac = ((aei_ha/100)/area_km))
Here is a quick histogram of the target variable. Lotta 0s.
hist(aei$irrperc, breaks = 20)
According to the FAO, the choice of which irrigation, and by proxy whether to irrigate or not, comes from six factors. The predictor that will be used for this analysis are listed below:
Economic
Topographical
Soil Type
Water Supply
Crop Type
Social Influences
All variables will be scaled or centered.
First, the ISO codes and Regions to be able to hold some structure.
#the ISO codes and regional data
iso <- read.csv2("/Volumes/RachelExternal/Thesis/Thesis/Data/GAPminder_ISOcodes.csv")
iso <- iso[, -c(2,7,10,12,13)] #removing extra info
#renaming some cols
iso <-
iso %>%
rename(ISO = GEO, country = name)
#fixing the variable types
iso$ISO <- as.factor(iso$ISO)
iso$country <- as.factor(iso$country)
iso$four_regions <- as.factor(iso$four_regions)
iso$eight_regions <- as.factor(iso$eight_regions)
iso$six_regions <- as.factor(iso$six_regions)
iso$six_regions <- as.factor(iso$six_regions)
iso$World.bank.region <- as.factor(iso$World.bank.region)
str(iso)
'data.frame': 197 obs. of 8 variables:
$ ISO : Factor w/ 197 levels "AFG","AGO","ALB",..: 1 3 50 4 2 8 6 7 9 10 ...
$ country : Factor w/ 197 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ four_regions : Factor w/ 4 levels "africa","americas",..: 3 4 1 4 1 2 2 4 3 4 ...
$ eight_regions : Factor w/ 8 levels "africa_north",..: 5 7 1 8 2 3 4 7 6 8 ...
$ six_regions : Factor w/ 6 levels "america","east_asia_pacific",..: 5 3 4 3 6 1 1 3 2 3 ...
$ Latitude : num 33 41 28 42.5 -12.5 ...
$ Longitude : num 66 20 3 1.52 18.5 ...
$ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: 7 3 5 3 8 4 4 3 2 3 ...
#future me heling past me
#fixing Macedonia, FYR to North Macedonia
iso$country <- as.character(iso$country)
iso[iso == "Macedonia, FYR"] <- "North Macedonia"
iso$country <- as.factor(iso$country)
Cool, we have multiple regions here that we could use in the future for some multilevel business. Also, latitude and longitude is here, which we can also try later. Obviously these lat and lon measurements will mean nothing for a country like China or the USA, but they are nice to have.
There are a lot of countries that have no data for the regions, which I will use in further analysis, so put these in there. Its going to be a mess.
na_regions <- aei[is.na(aei$four_regions),]
#Aruba
aei <- aei %>%
mutate(four_regions = replace(four_regions, ISO == 'ABW', 'americas'),
six_regions = replace(six_regions, ISO == 'ABW', 'america'),
eight_regions = replace(eight_regions, ISO == 'ABW', 'america_north')) %>%
#american samoa
mutate(four_regions = replace(four_regions,ISO == 'ASM', 'asia'),
six_regions = replace(six_regions,ISO == 'ASM', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'ASM', 'east_asia_pacific')) %>%
#bermuda
mutate(four_regions = replace(four_regions,ISO == 'BMU', 'americas'),
six_regions = replace(six_regions,ISO == 'BMU', 'america'),
eight_regions = replace(eight_regions,ISO == 'BMU', 'america_north')) %>%
#cote d'ivoire
mutate(four_regions = replace(four_regions,ISO == 'CIV', 'africa'),
six_regions = replace(six_regions,ISO == 'CIV', 'sub_saharan_africa'),
eight_regions = replace(eight_regions,ISO == 'CIV', 'africa_sub_saharan')) %>%
#democratic republic of congo
mutate(four_regions = replace(four_regions,ISO == 'COD', 'africa'),
six_regions = replace(six_regions,ISO == 'COD', 'sub_saharan_africa'),
eight_regions = replace(eight_regions,ISO == 'COD', 'africa_sub_saharan')) %>%
#republic of congo
mutate(four_regions = replace(four_regions,ISO == 'COG', 'africa'),
six_regions = replace(six_regions,ISO == 'COG', 'sub_saharan_africa'),
eight_regions = replace(eight_regions,ISO == 'COG', 'africa_sub_saharan')) %>%
#cayman islands
mutate(four_regions = replace(four_regions,ISO == 'CYM', 'americas'),
six_regions = replace(six_regions,ISO == 'CYM', 'america'),
eight_regions = replace(eight_regions,ISO == 'CYM', 'america_north')) %>%
#Faroe islands
mutate(four_regions = replace(four_regions,ISO == 'FRO', 'europe'),
six_regions = replace(six_regions,ISO == 'FRO', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'FRO', 'europe_west')) %>%
#micronesia
mutate(four_regions = replace(four_regions,ISO == 'FSM', 'asia'),
six_regions = replace(six_regions,ISO == 'FSM', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'FSM', 'east_asia_pacific')) %>%
#gibraltar
mutate(four_regions = replace(four_regions,ISO == 'GIB', 'europe'),
six_regions = replace(six_regions,ISO == 'GIB', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'GIB', 'europe_west')) %>%
#Guadeloupe
mutate(four_regions = replace(four_regions,ISO == 'GLP', 'americas'),
six_regions = replace(six_regions,ISO == 'GLP', 'america'),
eight_regions = replace(eight_regions,ISO == 'GLP', 'america_north')) %>%
#greenland
mutate(four_regions = replace(four_regions,ISO == 'GRL', 'europe'),
six_regions = replace(six_regions,ISO == 'GRL', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'GRL', 'europe_west')) %>%
#french Guiana
mutate(four_regions = replace(four_regions,ISO == 'GUF', 'americas'),
six_regions = replace(six_regions,ISO == 'GUF', 'america'),
eight_regions = replace(eight_regions,ISO == 'GUF', 'america_south')) %>%
#guam
mutate(four_regions = replace(four_regions,ISO == 'GUM', 'asia'),
six_regions = replace(six_regions,ISO == 'GUM', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'GUM', 'east_asia_pacific')) %>%
#Kyrgyzstan
mutate(four_regions = replace(four_regions,ISO == 'KGZ', 'asia'),
six_regions = replace(six_regions,ISO == 'KGZ', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'KGZ', 'asia_west')) %>%
#laos
mutate(four_regions = replace(four_regions,ISO == 'LAO', 'asia'),
six_regions = replace(six_regions,ISO == 'LAO', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'LAO', 'east_asia_pacific')) %>%
#Macedonia
mutate(four_regions = replace(four_regions,ISO == 'MKD', 'europe'),
six_regions = replace(six_regions,ISO == 'MKD', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'MKD', 'europe_east')) %>%
#Martinique
mutate(four_regions = replace(four_regions,ISO == 'MTQ', 'americas'),
six_regions = replace(six_regions,ISO == 'MTQ', 'america'),
eight_regions = replace(eight_regions,ISO == 'MTQ', 'america_north')) %>%
#new caledonia
mutate(four_regions = replace(four_regions,ISO == 'NCL', 'asia'),
six_regions = replace(six_regions,ISO == 'NCL', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'NCL', 'east_asia_pacific')) %>%
#puerto rico
mutate(four_regions = replace(four_regions,ISO == 'PRI', 'americas'),
six_regions = replace(six_regions,ISO == 'PRI', 'america'),
eight_regions = replace(eight_regions,ISO == 'PRI', 'america_north')) %>%
#Palestine (West Bank + Gaza strip)
mutate(four_regions = replace(four_regions,ISO == 'PSE', 'asia'),
six_regions = replace(six_regions,ISO == 'PSE', 'middle_east_north_africa'),
eight_regions = replace(eight_regions,ISO == 'PSE', 'asia_west')) %>%
#french Polynesia
mutate(four_regions = replace(four_regions,ISO == 'PYF', 'asia'),
six_regions = replace(six_regions,ISO == 'PYF', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'PYF', 'east_asia_pacific')) %>%
#Reunion
mutate(four_regions = replace(four_regions,ISO == 'REU', 'africa'),
six_regions = replace(six_regions,ISO == 'REU', 'sub_saharan_africa'),
eight_regions = replace(eight_regions,ISO == 'REU', 'africa_sub_saharan')) %>%
#Serbia, Montenegro, Kosovo
mutate(four_regions = replace(four_regions,ISO == 'SMK', 'europe'),
six_regions = replace(six_regions,ISO == 'SMK', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'SMK', 'europe_east')) %>%
#Slovakia
mutate(four_regions = replace(four_regions,ISO == 'SVK', 'europe'),
six_regions = replace(six_regions,ISO == 'SVK', 'europe_central_asia'),
eight_regions = replace(eight_regions,ISO == 'SVK', 'europe_east')) %>%
#Turks and Caicos Islands
mutate(four_regions = replace(four_regions,ISO == 'TCA', 'americas'),
six_regions = replace(six_regions,ISO == 'TCA', 'america'),
eight_regions = replace(eight_regions,ISO == 'TCA', 'america_north')) %>%
#East Timor
mutate(four_regions = replace(four_regions,ISO == 'TLS', 'asia'),
six_regions = replace(six_regions,ISO == 'TLS', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'TLS', 'east_asia_pacific')) %>%
#taiwan
mutate(four_regions = replace(four_regions,ISO == 'TWN', 'asia'),
six_regions = replace(six_regions,ISO == 'TWN', 'east_asia_pacific'),
eight_regions = replace(eight_regions,ISO == 'TWN', 'east_asia_pacific')) %>%
#British virgin Islands
mutate(four_regions = replace(four_regions,ISO == 'VGB', 'americas'),
six_regions = replace(six_regions,ISO == 'VGB', 'america'),
eight_regions = replace(eight_regions,ISO == 'VGB', 'america_north')) %>%
#US Virgin Islands
mutate(four_regions = replace(four_regions,ISO == 'VIR', 'americas'),
six_regions = replace(six_regions,ISO == 'VIR', 'america'),
eight_regions = replace(eight_regions, ISO == 'VIR', 'america_north'))
Next time I’ll write a function.
Income data is coming from Gapminder, as they have put real effort in to filling all the gaps in income and population for the last 150 years. Income is in terms of the 2011 USD.
#income here is 2011 usd
income <- read.csv("/Volumes/RachelExternal/Thesis/Thesis/Data/Gapminder_income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
income <- income[, c(1, 162:207)] #removing extra years
#making long data
income <-
income %>%
pivot_longer(!country, names_to = "year", values_to = "income")
#still have the xs there
income$year <- str_remove_all(income$year, "[X]")
#transformation of the variable types
income$year <- as.numeric(income$year)
income$country <- as.factor(income$country)
#log and standardize
income$log_income <- log(income$income)
income <- income %>%
mutate(log_income_std = log_income / mean(log_income))
#a summary to check that
str(income)
tibble [8,878 × 5] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 193 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : num [1:8878] 1960 1961 1962 1963 1964 ...
$ income : int [1:8878] 2740 2700 2680 2670 2650 2640 2600 2600 2620 2590 ...
$ log_income : num [1:8878] 7.92 7.9 7.89 7.89 7.88 ...
$ log_income_std: num [1:8878] 0.917 0.915 0.914 0.914 0.913 ...
summary(income)
country year income log_income log_income_std
Afghanistan : 46 Min. :1960 Min. : 312 Min. : 5.743 Min. :0.6651
Albania : 46 1st Qu.:1971 1st Qu.: 2250 1st Qu.: 7.719 1st Qu.:0.8939
Algeria : 46 Median :1982 Median : 5290 Median : 8.574 Median :0.9929
Andorra : 46 Mean :1982 Mean : 11344 Mean : 8.635 Mean :1.0000
Angola : 46 3rd Qu.:1994 3rd Qu.: 13575 3rd Qu.: 9.516 3rd Qu.:1.1020
Antigua and Barbuda: 46 Max. :2005 Max. :179000 Max. :12.095 Max. :1.4007
(Other) :8602
#future me helping past me
#fixing Eswatini to Swaziland (even though Eswatini is the correct name)
income$country <- as.character(income$country)
income[income == "Eswatini"] <- "Swaziland"
income$country <- as.factor(income$country)
Logging and standardizing is necessary here. Again, the argument is that wealth begets wealth, so income is logarithmic. From there just center the mean at 0 with a stdev of 1 for standardization.
Population data is also taken from Gapminder. The same process of logging and standardizing will be caried out here.
#population data also taken from Gapminder
population <- read.csv("/Volumes/RachelExternal/Thesis/Thesis/Data/Gapminder_population_total.csv")
#removing extra years
population <- population[, c(1, 162:207)]
#making longer data
population <-
population %>%
pivot_longer(!country, names_to = "year", values_to = "population")
#removing the Xs from the year strings
population$year <- str_remove_all(population$year, "[X]")
#fixing variable types
population$country <- as.factor(population$country)
population$year <- as.numeric(population$year)
#log and standardize, same argument as income: population begets population
population$log_pop <- log(population$population)
population <- population %>%
mutate(log_pop_std = log_pop / mean(log_pop))
summary(population)
country year population log_pop log_pop_std
Afghanistan : 46 Min. :1960 Min. :6.450e+02 Min. : 6.469 Min. :0.4316
Albania : 46 1st Qu.:1971 1st Qu.:9.045e+05 1st Qu.:13.715 1st Qu.:0.9151
Algeria : 46 Median :1982 Median :4.545e+06 Median :15.330 Median :1.0228
Andorra : 46 Mean :1982 Mean :2.404e+07 Mean :14.988 Mean :1.0000
Angola : 46 3rd Qu.:1994 3rd Qu.:1.320e+07 3rd Qu.:16.396 3rd Qu.:1.0939
Antigua and Barbuda: 46 Max. :2005 Max. :1.330e+09 Max. :21.008 Max. :1.4017
(Other) :8694
str(population)
tibble [8,970 × 5] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 195 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : num [1:8970] 1960 1961 1962 1963 1964 ...
$ population : int [1:8970] 9000000 9170000 9350000 9540000 9740000 9960000 10200000 10400000 10600000 10900000 ...
$ log_pop : num [1:8970] 16 16 16.1 16.1 16.1 ...
$ log_pop_std: num [1:8970] 1.07 1.07 1.07 1.07 1.07 ...
#future me helping past me again
#fixing Eswatini to Swaziland (even though Eswatini is the correct name)
population$country <- as.character(population$country)
population[population == "Eswatini"] <- "Swaziland"
population$country <- as.factor(population$country)
So some issues arise here with the fact that we have different country data for each dataframe. By using anti_join() I can see what wont be joined. Also this goes both ways, anti_join() returns rows of x that do not have a match in y.
isopop <- anti_join(iso, population, by = "country")
summary(isopop)
ISO country four_regions eight_regions
HKG :1 Hong Kong, China:1 africa :0 east_asia_pacific :2
TWN :1 Taiwan :1 americas:0 africa_north :0
AFG :0 Afghanistan :0 asia :2 africa_sub_saharan:0
AGO :0 Albania :0 europe :0 america_north :0
ALB :0 Algeria :0 america_south :0
AND :0 Andorra :0 asia_west :0
(Other):0 (Other) :0 (Other) :0
six_regions Latitude Longitude World.bank.region
america :0 Min. :22.29 Min. :114.2 East Asia & Pacific :2
east_asia_pacific :2 1st Qu.:22.71 1st Qu.:115.9 :0
europe_central_asia :0 Median :23.14 Median :117.6 Europe & Central Asia :0
middle_east_north_africa:0 Mean :23.14 Mean :117.6 Latin America & Caribbean :0
south_asia :0 3rd Qu.:23.57 3rd Qu.:119.3 Middle East & North Africa:0
sub_saharan_africa :0 Max. :24.00 Max. :121.0 North America :0
(Other) :0
popiso <- anti_join(population, iso, by = "country")
summary(popiso)
country year population log_pop log_pop_std
Afghanistan :0 Min. : NA Min. : NA Min. : NA Min. : NA
Albania :0 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Algeria :0 Median : NA Median : NA Median : NA Median : NA
Andorra :0 Mean :NaN Mean :NaN Mean :NaN Mean :NaN
Angola :0 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Antigua and Barbuda:0 Max. : NA Max. : NA Max. : NA Max. : NA
(Other) :0
Went back up and solved the issue of North Macedonia and Swaziland. There are still issues with Hong Kong and Taiwan as they are part of China. ISO has data for Hong Kong and Taiwan that population does not have.
isoinc <- anti_join(iso, income, by = "country")
summary(isoinc)
ISO country four_regions eight_regions
HKG :1 Holy See :1 africa :0 east_asia_pacific :2
HOS :1 Hong Kong, China:1 americas:0 europe_west :2
LIE :1 Liechtenstein :1 asia :2 africa_north :0
TWN :1 Taiwan :1 europe :2 africa_sub_saharan:0
AFG :0 Afghanistan :0 america_north :0
AGO :0 Albania :0 america_south :0
(Other):0 (Other) :0 (Other) :0
six_regions Latitude Longitude
america :0 Min. :22.29 Min. : 9.533
east_asia_pacific :2 1st Qu.:23.57 1st Qu.: 11.723
europe_central_asia :2 Median :32.95 Median : 63.306
middle_east_north_africa:0 Mean :33.84 Mean : 64.286
south_asia :0 3rd Qu.:43.22 3rd Qu.:115.868
sub_saharan_africa :0 Max. :47.17 Max. :121.000
World.bank.region
East Asia & Pacific :2
:1
Europe & Central Asia :1
Latin America & Caribbean :0
Middle East & North Africa:0
North America :0
(Other) :0
inciso <- anti_join(income, iso, by = "country")
summary(inciso)
country year income log_income log_income_std
Afghanistan :0 Min. : NA Min. : NA Min. : NA Min. : NA
Albania :0 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Algeria :0 Median : NA Median : NA Median : NA Median : NA
Andorra :0 Mean :NaN Mean :NaN Mean :NaN Mean :NaN
Angola :0 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Antigua and Barbuda:0 Max. : NA Max. : NA Max. : NA Max. : NA
(Other) :0
The issues here are with again, ISO has data for Hong Kong and Taiwan but also the Holy See and Liechtenstein and income does not have those.
popinc <- anti_join(population, income, by = "country")
summary(popinc)
country year population log_pop log_pop_std
Holy See :46 Min. :1960 Min. : 645 Min. : 6.469 Min. :0.4316
Liechtenstein:46 1st Qu.:1971 1st Qu.: 762 1st Qu.: 6.636 1st Qu.:0.4428
Afghanistan : 0 Median :1982 Median : 8702 Median : 8.259 Median :0.5510
Albania : 0 Mean :1982 Mean :13436 Mean : 8.392 Mean :0.5599
Algeria : 0 3rd Qu.:1994 3rd Qu.:26575 3rd Qu.:10.188 3rd Qu.:0.6797
Andorra : 0 Max. :2005 Max. :34700 Max. :10.454 Max. :0.6975
(Other) : 0
incpop <- anti_join(income, population, by = "country")
summary(incpop)
country year income log_income log_income_std
Afghanistan :0 Min. : NA Min. : NA Min. : NA Min. : NA
Albania :0 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Algeria :0 Median : NA Median : NA Median : NA Median : NA
Andorra :0 Mean :NaN Mean :NaN Mean :NaN Mean :NaN
Angola :0 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Antigua and Barbuda:0 Max. : NA Max. : NA Max. : NA Max. : NA
(Other) :0
Ok, population has some data for the Holy See and Liechtenstein, that income does not.
Alright, so I will left join ISO and population then again with income? Not sure that this will solve my problem, but lets try it.
#remove all these rando tables
rm(popinc, inciso, incpop, popiso, isoinc, isopop)
#
iso <- left_join(iso, population, by = "country")
iso <- left_join(iso, income, by = c("country", "year"))
# alright 197 countries! whoop!
summary(iso)
ISO country four_regions eight_regions
AFG : 46 Afghanistan : 46 africa :2484 africa_sub_saharan:2208
AGO : 46 Albania : 46 americas:1610 east_asia_pacific :1382
ALB : 46 Algeria : 46 asia :2624 asia_west :1242
AND : 46 Andorra : 46 europe :2254 europe_west :1150
ARE : 46 Angola : 46 europe_east :1104
ARG : 46 Antigua and Barbuda: 46 america_north :1058
(Other):8696 (Other) :8696 (Other) : 828
six_regions Latitude Longitude
america :1610 Min. :-42.00 Min. :-175.00
east_asia_pacific :1382 1st Qu.: 4.00 1st Qu.: -6.00
europe_central_asia :2484 Median : 17.33 Median : 21.00
middle_east_north_africa: 920 Mean : 19.11 Mean : 22.00
south_asia : 368 3rd Qu.: 40.00 3rd Qu.: 50.55
sub_saharan_africa :2208 Max. : 65.00 Max. : 179.14
World.bank.region year population log_pop
Europe & Central Asia :2392 Min. :1960 Min. :6.450e+02 Min. : 6.469
Sub-Saharan Africa :2208 1st Qu.:1971 1st Qu.:9.045e+05 1st Qu.:13.715
Latin America & Caribbean :1518 Median :1982 Median :4.545e+06 Median :15.330
East Asia & Pacific :1382 Mean :1982 Mean :2.404e+07 Mean :14.988
Middle East & North Africa: 966 3rd Qu.:1994 3rd Qu.:1.320e+07 3rd Qu.:16.396
South Asia : 368 Max. :2005 Max. :1.330e+09 Max. :21.008
(Other) : 138 NA's :2 NA's :2 NA's :2
log_pop_std income log_income log_income_std
Min. :0.4316 Min. : 312 Min. : 5.743 Min. :0.6651
1st Qu.:0.9151 1st Qu.: 2250 1st Qu.: 7.719 1st Qu.:0.8939
Median :1.0228 Median : 5290 Median : 8.574 Median :0.9929
Mean :1.0000 Mean : 11344 Mean : 8.635 Mean :1.0000
3rd Qu.:1.0939 3rd Qu.: 13575 3rd Qu.: 9.516 3rd Qu.:1.1020
Max. :1.4017 Max. :179000 Max. :12.095 Max. :1.4007
NA's :2 NA's :94 NA's :94 NA's :94
rm(income, population)
Now, join iso to aei with a merge (cause in this case the dynamics of merge are clearer for me). Hoping to preserve all data!
aei <- merge(aei, iso, by = c("ISO","country", "year"), all.x = TRUE)
rm(iso)
str(aei)
'data.frame': 1848 obs. of 21 variables:
$ ISO : Factor w/ 232 levels "","ABW","AFG",..: 2 2 2 2 2 2 2 2 3 3 ...
$ country : Factor w/ 232 levels "Afghanistan",..: 12 12 12 12 12 12 12 12 1 1 ...
$ year : num 1960 1970 1980 1985 1990 ...
$ ID : int 1 1 1 1 1 1 1 1 2 2 ...
$ aei_ha : num 0 0 0 0 0 ...
$ yearcount : num 0 10 20 25 30 35 40 45 0 10 ...
$ area_km : num 180 180 180 180 180 ...
$ irrperc : num 0 0 0 0 0 ...
$ irrfrac : num 0 0 0 0 0 ...
$ four_regions : Factor w/ 4 levels "africa","americas",..: NA NA NA NA NA NA NA NA 3 3 ...
$ eight_regions : Factor w/ 8 levels "africa_north",..: NA NA NA NA NA NA NA NA 5 5 ...
$ six_regions : Factor w/ 6 levels "america","east_asia_pacific",..: NA NA NA NA NA NA NA NA 5 5 ...
$ Latitude : num NA NA NA NA NA NA NA NA 33 33 ...
$ Longitude : num NA NA NA NA NA NA NA NA 66 66 ...
$ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: NA NA NA NA NA NA NA NA 7 7 ...
$ population : int NA NA NA NA NA NA NA NA 9000000 11200000 ...
$ log_pop : num NA NA NA NA NA ...
$ log_pop_std : num NA NA NA NA NA ...
$ income : int NA NA NA NA NA NA NA NA 2740 2570 ...
$ log_income : num NA NA NA NA NA ...
$ log_income_std : num NA NA NA NA NA ...
Solved the issue, 232 factors for both countries and ISO, meaning that we have been using the original aei file from Siebert as the main data frame to join to. Good. This is the data that needs to be maintained, NAs can pop up in other cols but not irrigated area.
But I still would like total GDP for later…
aei$income <- as.numeric(aei$income)
#mutating for total gdp and log gdp
aei <-
aei %>%
mutate(GDPtot = income*population) %>%
mutate(log_gdp_tot = log(GDPtot))
#creating the df to standardize without the NAs
aei2 <-
aei %>%
select(ISO, year, log_gdp_tot) %>%
na.omit() %>%
mutate(log_gdp_tot_std = log_gdp_tot/mean(log_gdp_tot)) %>%
select(ISO, year, log_gdp_tot_std)
#now rejoin!
aei <-
left_join(aei, aei2, by = c("ISO", "year"))
str(aei)
'data.frame': 1848 obs. of 24 variables:
$ ISO : Factor w/ 232 levels "","ABW","AFG",..: 2 2 2 2 2 2 2 2 3 3 ...
$ country : Factor w/ 232 levels "Afghanistan",..: 12 12 12 12 12 12 12 12 1 1 ...
$ year : num 1960 1970 1980 1985 1990 ...
$ ID : int 1 1 1 1 1 1 1 1 2 2 ...
$ aei_ha : num 0 0 0 0 0 ...
$ yearcount : num 0 10 20 25 30 35 40 45 0 10 ...
$ area_km : num 180 180 180 180 180 ...
$ irrperc : num 0 0 0 0 0 ...
$ irrfrac : num 0 0 0 0 0 ...
$ four_regions : Factor w/ 4 levels "africa","americas",..: NA NA NA NA NA NA NA NA 3 3 ...
$ eight_regions : Factor w/ 8 levels "africa_north",..: NA NA NA NA NA NA NA NA 5 5 ...
$ six_regions : Factor w/ 6 levels "america","east_asia_pacific",..: NA NA NA NA NA NA NA NA 5 5 ...
$ Latitude : num NA NA NA NA NA NA NA NA 33 33 ...
$ Longitude : num NA NA NA NA NA NA NA NA 66 66 ...
$ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: NA NA NA NA NA NA NA NA 7 7 ...
$ population : int NA NA NA NA NA NA NA NA 9000000 11200000 ...
$ log_pop : num NA NA NA NA NA ...
$ log_pop_std : num NA NA NA NA NA ...
$ income : num NA NA NA NA NA NA NA NA 2740 2570 ...
$ log_income : num NA NA NA NA NA ...
$ log_income_std : num NA NA NA NA NA ...
$ GDPtot : num NA NA NA NA NA ...
$ log_gdp_tot : num NA NA NA NA NA ...
$ log_gdp_tot_std : num NA NA NA NA NA ...
Finally, lets check out a quick histogram of gdp and pop,
hist(aei$log_gdp_tot_std, breaks = 100)
hist(aei$log_pop_std, breaks = 100)
lotanas <- aei[rowSums(is.na(aei)) > 0,]
lotanas %>%
group_by(country) %>%
select(income, population) %>%
summarise_all(funs(sum(is.na(.))))
The countries that dont have income also dont have population, so as least I am consistent in my missing data. I have to move on so I will just leave this here for a minute. Perhaps later we should fill these variables.
Add the rest of the predictors.
hist(aei$preciptot, breaks = 100)
hist(aei$precip_sc, breaks = 100)
hist(aei$precip_std, breaks = 100)
hist(aei$log_precip, breaks = 100)
hist(aei$log_precip_sc, breaks = 100)
hist(aei$log_precip_std, breaks = 100)
Huh.
#pull cft frac outta here
cft <-
cftandprecip %>%
select(-c(4)) %>% #removing precip
mutate_each(funs(.*cellaream2), c(4:35)) %>% #multiply all fractions with the cell area
group_by(ISO, year) %>% #group in to country and year
summarise_each(funs(sum)) %>% #take col sums for each crop
mutate(across(c(3:34), .fns = ~./cellaream2)) #divide col sums by total country area
#removing categories with 0
cft <- cft[, -c(18, 19, 26, 34, 35)]
aei <- merge(aei, cft, by = c("ISO", "year"), all.x = TRUE)
rm(cft, cftandprecip)
hist(aei$rugged_sc, breaks = 100)
Ok, so lets peek at how things change over time. Here is a graph of percentage of country area equipped for irrigation from 1960 - 2005.
Although this graph has a lot of countries and can get quite messy, some general trends appear. Asian countries seem to have high percentages of irrigated area per total land area (irrigation percentage). Then seemingly followed by European countries. Africa and Oceania do not seem to be irrigated as much. Be aware that you can these graphs are interactive, you may zoom, pan or click on the legend to isolate the regions. Although, a regional breakdown follows, just need to add some predictor variables first. Hang tight.
In general, Africa has low levels of irrigated area. Swaziland, Sao Tome and Principe and Mauritius lead here with majority of others falling below the levels of these countries.
Europe contains some countries that reach higher percentages of irrigated area. Countries such as the Netherlands, Romania, Albania, and Azerbaijan have reached peaks in the past (roughly 10-15% irrigated area) but were in a decline towards the end our my study period. Other countries such as Italy and Portugal have had more stable rises with less down turn toward. Unexpectedly, Russia has a very low percentage of irrigated area, however at this point, the assumption is that either Russia’s enormous size cancels out its irrigated area, or that Russia receives enough precipitation for the crops it grows that irrigation is not necessary. Further analysis should reveal this.
In the Americas, the % of area irrigated by countries is lower than Asia or Europe. Peaks here are Cuba and the Dominican Republic at around roughly 6-8% of their total countries being irrigated. There are some islands (Barbados and St. Lucia) that have sharp increases in irrigation fraction, but this could be because they are islands and any irrigaion at all would induce the ration of irrigated land to total land spike. Counrties that have experienced a steady increase, but are not the leaders in overall irrigaiton fraction are countries that are big and produce a lot of produce (USA, Mexico, Chile) The general trends of irrigation increase seem to be upward with perhaps a slight leveling off toward the end of the study period (for some countries) but lacking the distinct downturn that is seen in Europe.
The East Asian countries are really high in % of irrigated area. The four countries that reach upwards of (roughly) 10% or more of their total area irrigated toward the end of the study period are Vietnam, North Korea, South Korea, and Thailand. Other countries that have high irrfracs are China, Japan and the Philipines. It is expected that irrigaion fraction would be high here due to the amount of rice produced.
Oceania, due to its large amount of islands, does not have a lot of irrigation, majority of countries have 0% irrigation. New Zealand (around 2%) and Australia (less than 1%) have some irrigation. Although this could have a multitude of reasons why irrigation is so low here, regardless of the fact that Australia and New Zealand are very populous countries with mouths to feed. Australia is very large and perhaps its enormity cancels out the irrigated land it does have.
Very few countries here, countries in this area are large. Very high levels of irrigation here coming from most countries. Bangladesh, Pakistan, and India are heavily irrigated compared to their land area.
Irrigation is also prevalent here (but not as much as in south asia). These countries are dry and therefore would need irrigation to have domestic food suplies. Lebanon, Iraq and Israel lead here. Oil producing countries have little irrigation, they are dry and their gdp is high, perhaps importation is the easiest option for them? ***